Performance of Machine Learning Classifiers in Classifying Stunting among Under-Five Children in Zambia

Stunting is a global public health issue. We sought to train and evaluate machine learning (ML) classification algorithms on the Zambia Demographic Health Survey (ZDHS) dataset to predict stunting among children under the age of five in Zambia. We applied Logistic regression (LR), Random Forest (RF), SV classification (SVC), XG Boost (XgB) and Naïve Bayes (NB) algorithms to predict the probability of stunting among children under five years of age, on the 2018 ZDHS dataset. We calibrated predicted probabilities and plotted the calibration curves to compare model performance. We computed accuracy, recall, precision and F1 for each machine learning algorithm. About 2327 (34.2%) children were stunted. Thirteen of fifty-eight features were selected for inclusion in the model using random forest. Calibrating the predicted probabilities improved the performance of machine learning algorithms when evaluated using calibration curves. RF was the most accurate algorithm, with an accuracy score of 79% in the testing and 61.6% in the training data while Naïve Bayesian was the worst performing algorithm for predicting stunting among children under five in Zambia using the 2018 ZDHS dataset. ML models aids quick diagnosis of stunting and the timely development of interventions aimed at preventing stunting.


Introduction
Stunting is still one of the most serious health and welfare issues globally. In 2019, about 21.3% of children under five years of age were estimated to be stunted globally, and two out of five stunted children live in Africa [1]. Between 2000 and 2020, the global prevalence of stunting fell from about 30.3% to 22% [2]. Despite the fall in the magnitude of stunting, the prevalence of stunting has remained high in the sub-Saharan region [3]. In Zambia, the prevalence of stunting among children under five was estimated to be 35% in 2019 [4]. Prevalence of stunting is classified as very low, low, medium, high and very high if it is <2.5%, between 2.5 and 10%, between 10 and 20%, between 20 and 30% and above 30%, respectively [5].
Stunting among children is associated with short-term and long-term health and social outcomes. Mortality and morbidity are among the most common short-term effects of stunting [6,7]. Some long-term effects of childhood stunting include poor cognitive development, poor school performance, delay in motor development and poor maternal health outcomes [8][9][10].
Over the years, classical statistical models have been used to identify factors that are independently associated with stunting among children under five [11][12][13]. However, these methods tend not to be robust in situations where the number of covariates is more than observations and when there is multi-correlation among variables. Furthermore, they follow strict assumptions about the data and the data generating process, such as the distribution of errors and additivity of parameters with linear predictors, which may not hold in real-life [14]. Compared to classical models, machine learning models overcome the analytical challenges of a large number of covariates and multicollinearity, require fewer assumptions, incorporate high dimensional data and thus produce a more flexible relationship between predictor and outcome variables [15]. These methods have been applied in predicting malnutrition using different datasets [16][17][18][19]. Furthermore, machine learning methods have been shown to be superior to classical statistical methods when solving classification problems [20].
In this study, we aimed to train, evaluate and select the best machine learning classifier for predicting stunting among children under five years in Zambia and identifying important variables in the prediction of stunting using the 2018 Zambia Health Demographic survey dataset. This model would serve as the basis for developing an intelligent model for diagnosing or predicting stunting, and features identified as important predictors of stunting would serve as variables to target when designing interventions aimed at preventing stunting among children under five in Zambia.

Data Source and Research Workflow
We utilized nutrition data from the 2018 Zambian demographic health survey (ZDHS) conducted by the Zambian Statistical Agency in collaboration with USAID. The survey was conducted such that it is representative of the Zambian population. It employed a stratified two-stage sampling design. The strata were defined by province and residence (i.e., rural-urban)-there are 10 provinces in Zambia, giving a total of 20 strata. The first stage involved selecting clusters defined as Enumeration Areas (EAs). For each stratum, EAs were selected using a probability proportional to size algorithm. In the second stage, a fixed number of households were selected from each EA using a systematic sampling technique. Details of the sampling methods are described in [4].
The ZDHS was performed per the Declaration of Helsinki and approved by an appropriate ethics committee. Ethical clearance was obtained from the Ethical Review Committee of the Ministry of Health, the University of Zambia Biomedical Ethics Committee and the Tropical Disease Research Centre Ethics Committee. The 2018 ZDHS survey was approved on the 27th of March 2018 by the TDRC ethics committee under protocol number STC/2018/6 and by the IRB on 6th March 2018 under protocol number 132989.0.000.ZM.DHS.02. Informed consent was obtained from participants before data collection. Permission was sort and granted on 22 March 2021 from the DHS program to use this dataset for research, and the dataset was accessed through IPUMS [21]. All data were anonymized before the authors received the data. All methods were performed following the relevant guidelines and regulations. Figure 1 below depicts the research workflow. The pre-processing was followed by the feature selection, which led to a 30:70 split in the decision. In the 70% of the dataset (training dataset), model training was conducted, and then a model was selected and performance evaluated, while in the 30% of the data (testing dataset), the predictive models were validated and model performance compared to predict stunting.

Pre-Processing
In this study, the target feature was stunting, which was defined based on the WHO standard, of height-for-age Z-score (HAZ) < −2 standard deviations (SD) [5]. The mothers' socioeconomic, demographic characteristics, and feeding practices were selected as features from the ZDHS database. Missing instances were dropped from the analysis. Further, continuous variables were standardized to a standard normal distribution. Then we applied ordinal and one-hot encoding to ordinal and non-ordinal categorical features, respectively.

Feature Selection
Random Forest (RF) feature selection was used to select important features. Tulukdar (2020) recommends the use of RF feature selection when building a predictive model for malnutrition [19]. The model assigns an importance score to each feature, and features that had an importance score less than the average importance score were not included in the model. Figure 2 below shows each feature and its associated importance score.

Pre-Processing
In this study, the target feature was stunting, which was defined based on the WHO standard, of height-for-age Z-score (HAZ) < −2 standard deviations (SD) [5]. The mothers' socioeconomic, demographic characteristics, and feeding practices were selected as features from the ZDHS database. Missing instances were dropped from the analysis. Further, continuous variables were standardized to a standard normal distribution. Then we applied ordinal and one-hot encoding to ordinal and non-ordinal categorical features, respectively.

Feature Selection
Random Forest (RF) feature selection was used to select important features. Tulukdar (2020) recommends the use of RF feature selection when building a predictive model for malnutrition [19]. The model assigns an importance score to each feature, and features that had an importance score less than the average importance score were not included in the model. Figure 2 below shows each feature and its associated importance score.

Model Training
We split the data into 70% training and 30% testing dataset. We evaluated five widely used machine learning classifiers, namely: Logistic regression (LR), Random Forest (RF), Naïve Bayesian (NB), Support Vector Machine (SVM) and eXtreme Gradient Boosting (Xg boost), implemented in scikit learn [22], to predict the probability of stunting.

Logistic Regression
Logistic regression is a supervised machine learning algorithm used to solve classification problems [23]. It is a parametric method that assumes a Bernoulli distribution of the target variable and the independence of the observations [24]. Logistic regression is a common regression model used to predict class membership probabilities and is defined as: is the conditional probability of an observation being in class 1 given the covariates X, β 0 is the intercept, and β is the vector of regression coefficients. The logistic regression model can be fitted using the maximum likelihood method.

Model Training
We split the data into 70% training and 30% testing dataset. We evaluated five widely used machine learning classifiers, namely: Logistic regression (LR), Random Forest (RF), Naïve Bayesian (NB), Support Vector Machine (SVM) and eXtreme Gradient Boosting (Xg boost), implemented in scikit learn [22], to predict the probability of stunting.

Logistic Regression
Logistic regression is a supervised machine learning algorithm used to solve classification problems [23]. It is a parametric method that assumes a Bernoulli distribution of the target variable and the independence of the observations [24]. Logistic regression is a common regression model used to predict class membership probabilities and is defined as:

Random Forest
Random forest (RF) is an ensemble method consisting of a collection of tree-based structured classifiers [17]. RF is used for classification, regression and dimension reduction. It is efficient even in instances where there are more variables than observations. To classify, RF builds many decision trees; each tree makes its independent classification. The RF chooses the class that has the most votes.

Naïve Bayesian (NB)
Naïve Bayesian is a collection of machine learning classification algorithms built on the Bayes theorem. These algorithms are built on two main assumptions; the first is that every pair of features being classified is independent of the other, and the second is that each makes an independent and equal contribution to the outcome. Though simple, the NB has high functionality [25,26]. For a binary outcome, a Bernoulli Naïve Bayesian algorithm is appropriate. NB formula is given as: where X is the independent predictors and P(X) is the predictors' prior probability, also referred to as evidence. P(y|X) is the probability of label y given predictors X. This is also referred to as the posterior probability, and P(y) is referred to as the probability before evidence is seen or the prior. P(X|y) is known as the likelihood.

Support Vector Machine
A support vector machine is a supervised machine learning algorithm whose goal is identifying a reproducible hyperplane of n-dimensions that maximizes the distance between support vectors of two class labels. SVM models are effective when there are more variables than samples and still effective when the sample size is small. Although it is memory efficient, SVM models do not provide probability estimates directly but through an expensive five-fold cross-validation process [27,28] 2.4.5. XG Boost XG boost, also known as eXtreme Gradient Boosting, is a decision tree-based ensemble machine learning algorithm that uses a gradient boosting framework (Friedman et al. (2000)) [29]. Boosting involves combining weak classifiers to produce a powerful averaged classifier, and it is also a variance reduction technique. It can be applied to both classification and prediction problems. The boosted decision trees are designed for optimal speed and improved model performance [30,31].

Model Performance Evaluation
We plotted the reliability graphs to evaluate the performance of each model. We later calibrated the predicted probabilities to reflect the occurrence of stunting in the data using the isotopic regression. The major strength of isotopic regression as a calibration method is that it can correct any monotonic distortion [32]. We determined the optimal probability threshold, on the calibrated probabilities, for classifying an instance as stunted or not.
We used 3-fold cross-validation on the training set, and the performance was estimated on the testing set. Models were evaluated based on the F1 score, Cohen's kappa, the area under the precision-recall curve (AUC-PR) and the sensitivity and specificity of each model. Data analysis was conducted using Python version 3.10.2 [33]. Data were summarized using proportion and a chi-squared test of independence to test for any association between stunting. Statistical analysis was set at a p-value < 0.05.
F1 score is the harmonic mean of the precision and recall of the model is calculated using the formula below: where recall, also known as sensitivity, is the proportion classified as positive among all the positive instants in the dataset. Precision is the proportion of true positive instances among the instances that the model has predicted as positive. True positive (tp) is the number of positive instances that are classified as positive by the model. False-positive (fp) is the number of negative instances that are classified as negative by the model. Cohen's kappa score is a metric used to measure inter-rater agreement. Cohen kappa takes into account agreement that may exist between two measures due to chance, and this is one of the reasons that makes it a robust measure for evaluating classification models.

Characteristics of Participants
In our dataset, there were a total of 6799 children under five years. Of these, 3421 (50.3%) were male, and 5253 (77.3 %) were aged 12 months and above (Table 1). A child's age, wealth index, region and gender were associated with stunting. About 37.6% of children aged between 12 months and 59 months were stunted compared to 22.6% of the aged less than 12 months. A total of 38.2% of children born to a mother without any formal education were stunted. Most of the children were from poor families (48%), and 38.3% of these were stunted. The prevalence of stunting was 34.2%.  Figure 2 shows the importance score for each feature in the dataset. The child's age in months had the highest importance score, while having an improved toilet facility had the least importance score. Only 13 out of 58 features were included. Child's gender, mothers age, age of household head, child's age (months), number of sleeping rooms in the household, number of women between 15 and 49 in the household, the birth interval in months, shared toilet, mother's current employment status, years of education, number of children under five in the household and the total number of children ever born form a mother, were features that were selected as predictors of stunting among children under five. Figure 2 shows the feature importance score for each feature included in the analysis dataset in descending order. Figure 3 shows the calibration curves for each machine learning algorithm with the average predicted probability for each bin on the x-axis and the fraction of positive classes in each bin on the y-axis, and below the count of positives in each bin.

Comparison of Efficiency of Machine Learning Algorithm
Children 2022, 9, x FOR PEER REVIEW 14 of 19 Figure 3 shows the calibration curves for each machine learning algorithm with the average predicted probability for each bin on the x-axis and the fraction of positive classes in each bin on the y-axis, and below the count of positives in each bin. All five models showed divergence from the perfectly calibrated line (Figure 2). This implies the need to calibrate the predicted probability distribution. SV Classification was the worst-performing in all bins compared to other models. After calibrating the predicted probabilities, all models improved and were better aligned with the perfect calibration curve, except for the Naïve Bayes model. The model was dropped because it was inaccurate even after calibration (Figure 3).
We presented the predictive performance of each classifier on the training and test dataset ( Table 2). Logistic regression was the least accurate model both on the training and the test dataset, with an accuracy of 44.7% and 45.9%, respectively, whereas Random Forest Model was superior, with a training and testing accuracy of 79.2% and 61.6%, respectively. Random Forest Model had the large F1 score in training and the least in the testing dataset, while Logistic regression had the least F1 score in the training dataset; it had the highest score in the testing data. Logistic regression had a Cohen's kappa score of All five models showed divergence from the perfectly calibrated line (Figure 2). This implies the need to calibrate the predicted probability distribution. SV Classification was the worst-performing in all bins compared to other models. After calibrating the predicted probabilities, all models improved and were better aligned with the perfect calibration curve, except for the Naïve Bayes model. The model was dropped because it was inaccurate even after calibration (Figure 3).
We presented the predictive performance of each classifier on the training and test dataset ( Table 2). Logistic regression was the least accurate model both on the training and the test dataset, with an accuracy of 44.7% and 45.9%, respectively, whereas Random Forest Model was superior, with a training and testing accuracy of 79.2% and 61.6%, respectively.
Random Forest Model had the large F1 score in training and the least in the testing dataset, while Logistic regression had the least F1 score in the training dataset; it had the highest score in the testing data. Logistic regression had a Cohen's kappa score of 0.07 and 0.08 in the training and testing dataset, respectively. Random forest had a Cohen's kappa score of 0.55, the highest in the training dataset, and 0.178, the second largest in the testing dataset ( Table 2). We showed the average precision and recall for each machine learning classifier (Table 3). XB boost had precision and recall of 60%, and logistic regression had a precision of 58% and a recall of 55%. Random forest and CV classification had 58% and 59% precision and recall.

Discussion
In our study, we identified the random forest model as the model with the highest predictive accuracy for stunting among children under five years in Zambia using the ZDHS 2018 data. Despite Random Forest and XG Boost performing better than the traditional logistic regression, logistic regression still retains interpretability as the main advantage it has over the other ML algorithm. Similar studies used ML algorithms to predict the nutritional status of children using demographic health survey data [18][19][20]32]. Our results are similar to the findings of [19], which implicate the RF algorithm to be a superior predictor of stunting.
Further, we identified features that are important in predicting stunting among children under five in Zambia. Some of the identified features, such as Mother's education, mother's age, age of the child, family size and residence, were identified commonly as predictors of stunting [11,16,34,35]. We suggest the need to collect more features for predicting stunting as opposed to keeping only those variables collected in the ZDHS. Though powerful, the ML models have a limitation in that they do not come with odds ratios or coefficients to indicate the direction of the relationship of the important features. Knowing the direction of the association of each importance would enhance the design and implementation of interventions aimed at preventing stunting among children under 5 years.
Despite being a widely used measure for stunting, HAZ (Z-score < −2) is an arbitrary cut-off for stunting, which may have little clinical significance [36]. Our study presents an opportunity to look at individualized risk assessment for stunting among children. Further, our study took into consideration key social, economic and environmental factors that may be key determinants of a child's nutrition status compared to HAZ, which only takes into consideration the age, height and weight of children, which is one major strength of the ML algorithm in predicting health outcomes [37].
The major strength of this study is using the ZDHS dataset; the ZDHS applies a sampling method that is robust and, as such, is representative of the under-five population of children from both rural and urban parts of Zambia. Despite applying an optimized machine classification model, the study only applied five mostly used algorithms, yet more algorithms such as the ones used in [38]. Machine learning algorithms use features based on their importance or contribution to the model and not necessarily causal effect.
We recommend further research or developing a prognostic model for stunting using longitudinal data. This would help in the development of timely interventions aimed at preventing stunting. Although the ZDHS dataset is very representative of the Zambian population of children under five, it may not contain features that would be very instrumental in building a predictive model for stunting; this is because data are mainly not collected for the sole purpose of a study of this nature. ML learning aids timely diagnosis of stunting and timely design, evaluation and deployment of mitigation measures. Since the effects of stunting are unreversible in the long run, having a machine learning-based risk score would aid the treatment of malnutrition.

Conclusions
The results suggest that the Random Forest machine learning algorithm has the highest predictive accuracy for stunting compared to other models applied in this study. We also identified the children's and their mothers' social and economic features that are important predictors of stunting among children under five.